Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | Go Back
View Post [edit]

Poster: pabouk Date: Jan 8, 2009 2:25am
Forum: web Subject: The archive retruns wrong Content-type HTTP header

Hello,

sorry for posting this again but this is a bug report and I got no reply.

Several times I have seen that the web archive returns bad "Content-type" HTTP header with wrong character set. Examples:

$ wget -S http://web.archive.org/web/20030524081559/http://www.iriverjapan.com/product.php?product=iHP-100
--2008-07-28 15:23:00
-- http://web.archive.org/web/20030524081559/http://www.iriverjapan.com/product.php?product=iHP-100
Resolving web.archive.org... 207.241.227.154
Connecting to web.archive.org|207.241.227.154|:80...
connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Mon, 28 Jul 2008 13:23:01 GMT
Server: Apache/2.2.4 (Ubuntu) PHP/5.2.3-1ubuntu6 mod_perl/2.0.2 Perl/v5.8.8
X-Powered-By: PHP/4.2.3
Content-Type: text/html; charset=UTF-8
Connection: close
Length: unspecified [text/html]

$ wget -S http://web.archive.org/web/20050518010425/http://www.didaktik.cz/pocitace_didaktik/didaktik_8.htm
--2008-07-28 15:33:53
-- http://web.archive.org/web/20050518010425/http://www.didaktik.cz/pocitace_didaktik/didaktik_8.htm
Resolving web.archive.org... 207.241.227.154
Connecting to web.archive.org|207.241.227.154|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Mon, 28 Jul 2008 13:33:54 GMT
Server: Apache/2.2.4 (Ubuntu) PHP/5.2.3-1ubuntu6 mod_perl/2.0.2 Perl/v5.8.8
Accept-Ranges: bytes
ETag: "306df66a864ec51:1695"
Last-Modified: Sun, 01 May 2005 19:46:07 GMT
Content-Length: 8750
Content-Type: text/html; charset=UTF-8
Connection: close
Length: 8750 (8.5K) [text/html]

In the first case the archive returns "Content-Type: text/html; charset=UTF-8" although the archived page is in "x-sjis" charset as it is indicated in the HTML meta tags,

In the second case the archive returns "Content-Type: text/html; charset=UTF-8" again! (Is not it always?) Although the page is in "Windows-1250" charset. (Unfortunately it is not indicated by the meta tags.)

In both cases better result would be acquired by omitting the "charset=UTF-8" part. Do you please know why the archive wrongly asserts the UTF-8 character set in the HTTP header? Unfortunately the HTTP header overrides HTML meta tags. Does the archive store the original HTTP headers?

Thank you.

PS: It seems that the same problem reported here:
http://www.archive.org/iathreads/post-view.php?id=186672 has been resolved and the Wayback machine no longer returns UTF-8 in the Content-type header. Could you please correct the problem on other pages as well?

Reply to this post
Reply [edit]

Poster: pabouk Date: Jan 8, 2009 2:49am
Forum: web Subject: Re: The archive retruns wrong Content-type HTTP header

It seems that the problem has been resolved. What a coincidence! Several hours ago the problem was there for sure.

Thank you!